Skip to content

Deduplicate validation results yield from validate.validate()#1619

Merged
yarikoptic merged 1 commit intodandi:masterfrom
candleindark:unique-errors
Apr 25, 2025
Merged

Deduplicate validation results yield from validate.validate()#1619
yarikoptic merged 1 commit intodandi:masterfrom
candleindark:unique-errors

Conversation

@candleindark
Copy link
Copy Markdown
Member

@candleindark candleindark commented Apr 24, 2025

This PR closes #1617.

This solution is a bit awkward in my opinion, but it fixes the problem without overhaul of how validation is done in dandi-cli.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 24, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.43%. Comparing base (4f466a8) to head (1fb98b0).
Report is 72 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1619      +/-   ##
==========================================
+ Coverage   88.28%   88.43%   +0.15%     
==========================================
  Files          78       78              
  Lines       11038    11045       +7     
==========================================
+ Hits         9745     9768      +23     
+ Misses       1293     1277      -16     
Flag Coverage Δ
unittests 88.43% <100.00%> (+0.15%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@candleindark candleindark requested a review from Copilot April 24, 2025 02:36
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses the issue of duplicate validation results from validate.validate() by introducing deduplication logic.

  • Introduces df_results and df_result_ids to track unique ValidationResult objects.
  • Updates the loop to yield only unique validation errors.

@candleindark
Copy link
Copy Markdown
Member Author

@yarikoptic Need to have this one approved quick. It is needed for #1599.

@yarikoptic
Copy link
Copy Markdown
Member

"not sure", since this would just address the symptom, potentially hiding away some code inefficiency resulting in duplicate records. Did you look why we get duplicate records?

@candleindark
Copy link
Copy Markdown
Member Author

"not sure", since this would just address the symptom, potentially hiding away some code inefficiency resulting in duplicate records. Did you look why we get duplicate records?

There is some inefficiency but not terrible, and this inefficiency is not the result of a bug but somewhat by design. Let's look at the code without the changes in this PR.

By NOT terrible inefficiency I mean that no repetitive validations have been done, and no duplicate ValidationResult objects are actually created. The duplication is resulting from repeated inclusions of the same ValidationResult objects at

dandi-cli/dandi/validate.py

Lines 171 to 176 in 4f466a8

for df in find_dandi_files(
p, dandiset_path=dandiset_path, allow_all=allow_any_path
):
yield from df.get_validation_errors(
schema_version=schema_version, devel_debug=devel_debug
)

df here is a DandiFile object that represents a "file" in the dandiset. For a dandiset that is also a BIDS dataset, df can be a BIDSDatasetDescriptionAsset or BIDSAsset.

When df is a BIDSDatasetDescriptionAsset, its get_validation_errors() method returns all the ValidationResults pertaining to the dataset and all individual assets,

def get_validation_errors(
self,
schema_version: str | None = None,
devel_debug: bool = False,
) -> list[ValidationResult]:
self._validate()
assert self._dataset_errors is not None
if self._asset_errors is not None:
return self._dataset_errors + [
i for j in self._asset_errors.values() for i in j
]
else:
return self._dataset_errors

When df is a BIDSAsset, its get_validation_errors() method returns ValidationResults pertaining to the dataset and the particular asset through its attached BIDSDatasetDescriptionAsset object,

def get_validation_errors(
self,
schema_version: str | None = None,
devel_debug: bool = False,
) -> list[ValidationResult]:
return self.bids_dataset_description.get_asset_errors(self)

def get_asset_errors(self, asset: BIDSAsset) -> list[ValidationResult]:
""":meta private:"""
self._validate()
errors: list[ValidationResult] = []
if self._dataset_errors:
errors.extend(self._dataset_errors)
assert self._asset_errors is not None
errors.extend(self._asset_errors[asset.bids_path])
return errors

There reason that there is no repetition in validation and creation of the same validation result objects is that all BIDSAsset objects share an instance of BIDSDatasetDescriptionAsset

bids_dataset_description_ref: weakref.ref[BIDSDatasetDescriptionAsset]
, and BIDSDatasetDescriptionAsset._validate() actually only do validation once in its life time
def _validate(self) -> None:
with self._lock:
if self._dataset_errors is None:
# Import here to avoid circular import
from dandi.validate import validate_bids
results = validate_bids(self.bids_root)
self._dataset_errors: list[ValidationResult] = []
self._asset_errors: dict[str, list[ValidationResult]] = defaultdict(
list
)
# Don't apply eta-reduction to the lambda, as mypy needs to be
# assured that defaultdict's argument takes no parameters.
self._asset_metadata = defaultdict(
lambda: BareAsset.model_construct() # type: ignore[call-arg]
)
for result in results:
if result.id in BIDS_ASSET_ERRORS:
assert result.path
bids_path = result.path.relative_to(self.bids_root).as_posix()
self._asset_errors[bids_path].append(result)
elif result.id in BIDS_DATASET_ERRORS:
self._dataset_errors.append(result)
elif result.id == "BIDS.MATCH":
assert result.path
bids_path = result.path.relative_to(self.bids_root).as_posix()
assert result.metadata is not None
self._asset_metadata[bids_path] = prepare_metadata(
result.metadata
)
self._bids_version = result.origin.standard_version

In sum, the repetition is only in the inclusions of the same ValidationResult objects (in validate.validate()) and results from how files in dandiset are currently represented.

@yarikoptic yarikoptic added UX patch Increment the patch version when merged labels Apr 25, 2025
@yarikoptic yarikoptic merged commit a774355 into dandi:master Apr 25, 2025
29 of 30 checks passed
@github-actions
Copy link
Copy Markdown

🚀 PR was released in 0.68.1 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

patch Increment the patch version when merged released UX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Duplicate errors in dandi validate output

3 participants